Let’s call the help function of library and see what happens.

help(library)

You can add a new chunk of code by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Let’s take a quick look at how are notebook looks like right now.

We’re going to cover:

Tidyverse - Packages for data wranglers

Tibble - Tibbles as dataframes

stringr - managing strings

Tidyr - pivot_longer: Pivot wide data into long format (i.e. “melt”) - pivot_wider: Pivot long data into wide format (i.e. “cast”) - separate: Separate (i.e. split) one column into multiple columns - unite: Unite (i.e. combine) multiple columns into one

Regressions:

Output:

Tidyverse

So, as it turns out, you’ve already been working with a package within the Tidyverse!

tidyverse is a collection of packages. Many data scientists use it - and it’s basically a bunch of packages written in similar style with a similar philosophy of approaching data structures.

This book is an additional resource to help you splice, groupby, mutate and dice up your data.

You can install this with

#install.packages("tidyverse")

and then load it up with

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.4.2     ✔ purrr   1.0.1
## ✔ tibble  3.2.1     ✔ dplyr   1.1.1
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Once you have this loaded, you have the following packages included:

For importing, it also automatically imports a bunch of other packages that cover:

Recall from last class:

Haven commands:

SAS read_sas(“mtcars.sas7bdat”) write_sas(mtcars, “mtcars.sas7bdat”)

SPSS read_sav(“mtcars.sav”) write_sav(mtcars, “mtcars.sav”)

Stata read_dta(“mtcars.dta”) write_dta(mtcars, “mtcars.dta”)

Readr (export/import) commands:

jobs_2 = read_csv(‘job-automation-probability.csv’) jobs_2 = read_delim(‘job-automation-probability.csv’, delim = ‘,’) write.csv(jobs, ‘jobs2.csv’)

A cheat sheet exists for A LOT of packages - you can take a look here

Tibble

Let’s take a look at tibble. It is just a dataframe. Really not much to it.

The thing is, at first you will be annoyed. There will be a lot of errors that you are going to have to confront. The idea is that instead of R just figuring it out for you (such as columns automatically turned into strings where there are strings), it will show an error.

This is good. Because it makes you have to deal with issues with your data now - instead of dealing with an error that you don’t know why it’s giving you an error in a package or function later on.

Let’s import the dataset that we’ll use today - growth rate by country - using the R package WDI. Be sure to install packages as install.packages(WDI)

Honestly, tibble is pretty much a pandas data frame. It displays the first 10 lines and takes your data as is so that you can work from it from the ground up.

#install.packages("tidyverse", "wdi")
#notice that you can import a few packages in one line
library("WDI")
wdi <- WDI(country = "all", start=2000, end=2015, extra="TRUE",
           indicator=c("NY.GDP.MKTP.KD.ZG"))

Search for indicators that contain the word ‘gdp’

gdp_indic <- WDIsearch('gdp')
wdi <-as_tibble(wdi)

#it's just that easy

A lot more about tibble can be found here

We can extract columns

python df.columname

#wdi$NY.GDP.MKTP.KD.ZG

Notice that Rstudio shows you all of the columns after you input the dollar sign.

dplyr plays nice with tibble Let’s rename a column

wdi <- rename(wdi,  gdp= 'NY.GDP.MKTP.KD.ZG')

Stringr

So, know that you are using tibble, you’re going to need to know some basics of how to deal with strings.

It’s particularly useful when you want to search or filter within a string.

In Python, if you assign an item a string, you can split, filter, etc. Typically, in pandas, we will assign a column a string or tell pandas we we want string operations on a column (like df['columnname'].str.replace(',',''))

This is not so dissimilar in R.

We us the function filter and apply the string detect function to search for a particular word (in this example)

wdi %>% filter(str_detect(wdi$country, "World"))

You can use these options to help you match specific cases

. to match any character (except a newline)

^ to match the start of the string.

$ to match the end of the string.

So, if we want to have countries that start with “United”, we can do this as:

wdi %>% filter(str_detect(wdi$country, "^World"))

Similar to pandas, you’ll often need to clean a column of strings or other symbols (particularly if it’s something you want to be integers). You’ll use the str_replace or str_replace_all

str_replace_all(df$columname, "things to replace", "replace it with something")

#Don't forget to assign it!
wdi$gdp <- str_replace_all(wdi$gdp, "[,$!]", "")

A whole chapter to this that you should read is here

Tidy Data

R follows a set of conventions that makes one layout of tabular data much easier to work with. We usually call this format “Long”, and it will always have:

  1. Each variable in the data set is placed in its own column
  2. Each observation is placed in its own row
  3. Each value is placed in its own cell

So, generally, you’re trying to get your data into that point. Luckily, there are a few commands to help you with this.

There are lots of ways to do this, though: reshape(), melt() [from wide to long], cast() [from long to wide], gather()[from wide to long], spread()[from long to wide]

But, we’ll learn to transform data with pivot_longer and pivot_wider

Pivot wider

Now, let’s try to pivot our data to the wide format.

When we pivot wider, it’s going to look something like this:

wdi_wide <- wdi %>% pivot_wider(names_from=year,values_from=gdp)
wdi_wide

Pivot Longer

We can transform the data into long format with the function pivot_longer.

It’ll look something like this:

Before I pivot, I’m going to check the names of my columns to see what columns I need to transform.

names(wdi_wide)
##  [1] "iso2c"     "country"   "iso3c"     "region"    "capital"   "longitude"
##  [7] "latitude"  "income"    "lending"   "2009"      "2005"      "2010"     
## [13] "2006"      "2007"      "2012"      "2011"      "2000"      "2014"     
## [19] "2013"      "2002"      "2008"      "2004"      "2015"      "2003"     
## [25] "2001"
names(wdi_wide) # checking the names of the columns to know what I should grab
##  [1] "iso2c"     "country"   "iso3c"     "region"    "capital"   "longitude"
##  [7] "latitude"  "income"    "lending"   "2009"      "2005"      "2010"     
## [13] "2006"      "2007"      "2012"      "2011"      "2000"      "2014"     
## [19] "2013"      "2002"      "2008"      "2004"      "2015"      "2003"     
## [25] "2001"
pivot_longer(wdi_wide, c('2005':'2002'), names_to='year',values_to='gdp')

Separate

Another handy thing to learn in R is separate

let’s say you have dates that you want to create two columns with. You can use the function separate to help you create two distinct columns in R.

#include Python code example

stocks = read.csv("all_stocks_5yr.csv")
date_sep <- stocks %>% separate(date, into = c("year", "month", "day"), sep = '-')

#the arrow denotes assigning an object - the direction of the arrow matters, but you can see the placement before or after the code is equivalent. The code below works exactly the same as the code above.
stocks %>% separate(date, into = c("year", "month", "day"), sep = '-') -> dat_sep

#equivalent to
dat_sep <- stocks %>% separate(date, into = c("year", "month", "day"), sep = '-')

date_sep

Unite

Separates sibling is Unite - where you can combine two columns into one.

date_sep %>% unite(date, c("month", "day", "year"), sep = "/")

Regressions

Now Let’s get to running those regressions

The general format is that you will specify the model as the function and inside that function you will define the regression model that you want to run.

Stata’s “reg” is R’s “lm” which stands for linear model and is at the core of regression analysis.

The model will look something like this: lm(y ~ x1 + x2 + x3 + ..., data = df)

You can also call each of the columns within a dataframe like this: lm(df$y ~ df$x1 + df$x2 + df$x3 + ...) But, I prefer simplicity

Let’s try running a basic OLS regression with our jobs dataset.

jobs = read.csv('/Users/mkaltenberg/Documents/GitHub/Data_Analysis_Python_R/Becoming a Viz Kid/job-automation-probability.csv')

ols <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs)

Alright, so we got a regression!

We can view some of the results in the stored item on the left. Or let’s look into it with a function summary()

summary(ols)
## 
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72928 -0.24991  0.06363  0.24745  0.92316 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.213e-01  2.622e-02  35.130   <2e-16 ***
## average_ann_wage -6.971e-06  4.031e-07 -17.293   <2e-16 ***
## numbEmployed      1.168e-08  2.761e-08   0.423    0.672    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3078 on 699 degrees of freedom
## Multiple R-squared:  0.3029, Adjusted R-squared:  0.3009 
## F-statistic: 151.8 on 2 and 699 DF,  p-value: < 2.2e-16

That’s better! Ok, so, we can see all of our general statistics here. We can also view specific parts by using the dollar sign to indicate a part of the output we want to view

summary(ols)$coefficients
##                       Estimate   Std. Error     t value      Pr(>|t|)
## (Intercept)       9.212632e-01 2.622403e-02  35.1304979 1.484084e-156
## average_ann_wage -6.970858e-06 4.030936e-07 -17.2933997  4.776142e-56
## numbEmployed      1.167886e-08 2.761272e-08   0.4229523  6.724601e-01

You can run a subset of the data utilizing filter and grepl. NOTE the difference in the parenthesis.

The equivalent in stata would be: reg prob average_ann_wage numbEmployed if education != “No formal educational credential”

ols2 <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs %>% filter(!(grepl("No formal educational credential", education))))
summary(ols2)
## 
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs %>% 
##     filter(!(grepl("No formal educational credential", education))))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69748 -0.27219  0.03687  0.27719  0.90587 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.863e-01  3.016e-02  29.383   <2e-16 ***
## average_ann_wage -6.556e-06  4.447e-07 -14.743   <2e-16 ***
## numbEmployed     -1.239e-08  4.216e-08  -0.294    0.769    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3202 on 601 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2632 
## F-statistic: 108.7 on 2 and 601 DF,  p-value: < 2.2e-16

This is the same as this format (reminder of pipes and how we can use them)

jobs_noedu =
  jobs %>% 
  filter(education != "No formal educational credential")
  # filter(!(grepl("No formal educational credential", name))) ## This also works 

ols2 <- lm(prob ~ average_ann_wage + numbEmployed , data = jobs_noedu)
summary(ols2)
## 
## Call:
## lm(formula = prob ~ average_ann_wage + numbEmployed, data = jobs_noedu)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69748 -0.27219  0.03687  0.27719  0.90587 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       8.863e-01  3.016e-02  29.383   <2e-16 ***
## average_ann_wage -6.556e-06  4.447e-07 -14.743   <2e-16 ***
## numbEmployed     -1.239e-08  4.216e-08  -0.294    0.769    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3202 on 601 degrees of freedom
## Multiple R-squared:  0.2656, Adjusted R-squared:  0.2632 
## F-statistic: 108.7 on 2 and 601 DF,  p-value: < 2.2e-16

Robust standard errors

Often (like REALLY REALLY often) we want to include robust standard errors. In stata, this is an option in the command reg.

White Standard Error Adjustment (Robust standard errors) in Stata: reg prob average_ann_wage numbEmployed, r

In R, it’s an entirely new function. So, let’s import the package estimates

library(estimatr)

ols1_robust = lm_robust(prob ~ average_ann_wage + numbEmployed , data = jobs)
summary(ols1_robust)
## 
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed, data = jobs)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                    Estimate Std. Error  t value   Pr(>|t|)   CI Lower
## (Intercept)       9.213e-01  2.860e-02  32.2078 3.044e-140  8.651e-01
## average_ann_wage -6.971e-06  4.705e-07 -14.8151  2.182e-43 -7.895e-06
## numbEmployed      1.168e-08  2.046e-08   0.5708  5.683e-01 -2.849e-08
##                    CI Upper  DF
## (Intercept)       9.774e-01 699
## average_ann_wage -6.047e-06 699
## numbEmployed      5.185e-08 699
## 
## Multiple R-squared:  0.3029 ,    Adjusted R-squared:  0.3009 
## F-statistic: 113.5 on 2 and 699 DF,  p-value: < 2.2e-16

A better way to do this so that the model can be incorporated in stargazer is the sandwhich pacakge. estimatr for some reason doesn’t play well with stargazer.

library(sandwich) 
library(lmtest) # to use coeftest
## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
# vcovHC   is the heteroskedasticity consistent estimation of the covariance matrix of the coefficient estimates


coeftest(ols, vcov = vcovHC(ols, type="HC0"))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error  t value Pr(>|t|)    
## (Intercept)       9.2126e-01  2.8264e-02  32.5951   <2e-16 ***
## average_ann_wage -6.9709e-06  4.6296e-07 -15.0571   <2e-16 ***
## numbEmployed      1.1679e-08  2.0139e-08   0.5799   0.5622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#This gives your the standard errors

Binary Variables

Another thing you may want to do is include a dummy variable in your regression.

Generally, we consider this factors. In stata, you can include factors as i.dummy_variable

R makes this pretty easy - it automatically knows that you are using a string variable and will create categorical variables (factors) out of that.

ols_edu = lm_robust(prob ~ average_ann_wage + numbEmployed + factor(education), data = jobs)
summary(ols_edu)
## 
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed + 
##     factor(education), data = jobs)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                         6.311e-01  5.882e-02
## average_ann_wage                                   -2.834e-06  5.229e-07
## numbEmployed                                       -5.931e-09  2.179e-08
## factor(education)Bachelor's degree                 -1.842e-01  5.631e-02
## factor(education)Doctoral or professional degree   -1.834e-01  7.130e-02
## factor(education)High school diploma or equivalent  1.998e-01  5.262e-02
## factor(education)Master's degree                   -2.956e-01  6.292e-02
## factor(education)No formal educational credential   2.471e-01  5.597e-02
## factor(education)Postsecondary nondegree award     -1.187e-02  6.813e-02
## factor(education)Some college, no degree            1.619e-01  1.297e-01
##                                                    t value  Pr(>|t|)   CI Lower
## (Intercept)                                        10.7300 5.952e-25  5.156e-01
## average_ann_wage                                   -5.4190 8.281e-08 -3.860e-06
## numbEmployed                                       -0.2722 7.856e-01 -4.872e-08
## factor(education)Bachelor's degree                 -3.2710 1.125e-03 -2.947e-01
## factor(education)Doctoral or professional degree   -2.5728 1.030e-02 -3.234e-01
## factor(education)High school diploma or equivalent  3.7970 1.593e-04  9.649e-02
## factor(education)Master's degree                   -4.6982 3.167e-06 -4.192e-01
## factor(education)No formal educational credential   4.4160 1.166e-05  1.373e-01
## factor(education)Postsecondary nondegree award     -0.1743 8.617e-01 -1.456e-01
## factor(education)Some college, no degree            1.2484 2.123e-01 -9.274e-02
##                                                      CI Upper  DF
## (Intercept)                                         7.466e-01 692
## average_ann_wage                                   -1.807e-06 692
## numbEmployed                                        3.685e-08 692
## factor(education)Bachelor's degree                 -7.363e-02 692
## factor(education)Doctoral or professional degree   -4.344e-02 692
## factor(education)High school diploma or equivalent  3.031e-01 692
## factor(education)Master's degree                   -1.721e-01 692
## factor(education)No formal educational credential   3.570e-01 692
## factor(education)Postsecondary nondegree award      1.219e-01 692
## factor(education)Some college, no degree            4.166e-01 692
## 
## Multiple R-squared:  0.4489 ,    Adjusted R-squared:  0.4417 
## F-statistic:  93.9 on 9 and 692 DF,  p-value: < 2.2e-16

Interaction variables

There is a convenient syntax on how to include interaction terms:

x1:x2 “crosses” the variables (equivalent to including only the x1 × x2 interaction term)
x1/x2 “nests” the second variable within the first (equivalent to x1 + x1:x2)
x1*x2 includes all parent and interaction terms (equivalent to x1 + x2 + x1:x2)

Almost always you’ll use * (all parent and interaction terms). There are situations in which you wouldn’t, but it’s rare.

So, let’s check it out:

int = lm_robust(prob ~ average_ann_wage + numbEmployed*education, data = jobs)
summary(int)
## 
## Call:
## lm_robust(formula = prob ~ average_ann_wage + numbEmployed * 
##     education, data = jobs)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                                                           Estimate Std. Error
## (Intercept)                                              6.664e-01  7.856e-02
## average_ann_wage                                        -2.847e-06  5.350e-07
## numbEmployed                                            -5.138e-07  8.928e-07
## educationBachelor's degree                              -2.247e-01  7.686e-02
## educationDoctoral or professional degree                -2.147e-01  9.046e-02
## educationHigh school diploma or equivalent               1.742e-01  7.376e-02
## educationMaster's degree                                -2.848e-01  8.770e-02
## educationNo formal educational credential                2.022e-01  7.617e-02
## educationPostsecondary nondegree award                  -4.516e-02  9.087e-02
## educationSome college, no degree                         8.691e-03  1.601e-01
## numbEmployed:educationBachelor's degree                  5.455e-07  8.975e-07
## numbEmployed:educationDoctoral or professional degree    4.856e-07  9.151e-07
## numbEmployed:educationHigh school diploma or equivalent  4.514e-07  8.946e-07
## numbEmployed:educationMaster's degree                   -1.708e-07  9.778e-07
## numbEmployed:educationNo formal educational credential   5.334e-07  8.929e-07
## numbEmployed:educationPostsecondary nondegree award      4.995e-07  9.250e-07
## numbEmployed:educationSome college, no degree            6.669e-07  9.110e-07
##                                                          t value  Pr(>|t|)
## (Intercept)                                              8.48369 1.343e-16
## average_ann_wage                                        -5.32124 1.398e-07
## numbEmployed                                            -0.57550 5.651e-01
## educationBachelor's degree                              -2.92354 3.575e-03
## educationDoctoral or professional degree                -2.37356 1.789e-02
## educationHigh school diploma or equivalent               2.36178 1.847e-02
## educationMaster's degree                                -3.24796 1.219e-03
## educationNo formal educational credential                2.65441 8.129e-03
## educationPostsecondary nondegree award                  -0.49705 6.193e-01
## educationSome college, no degree                         0.05428 9.567e-01
## numbEmployed:educationBachelor's degree                  0.60777 5.435e-01
## numbEmployed:educationDoctoral or professional degree    0.53070 5.958e-01
## numbEmployed:educationHigh school diploma or equivalent  0.50455 6.140e-01
## numbEmployed:educationMaster's degree                   -0.17471 8.614e-01
## numbEmployed:educationNo formal educational credential   0.59745 5.504e-01
## numbEmployed:educationPostsecondary nondegree award      0.53996 5.894e-01
## numbEmployed:educationSome college, no degree            0.73201 4.644e-01
##                                                           CI Lower   CI Upper
## (Intercept)                                              5.122e-01  8.207e-01
## average_ann_wage                                        -3.898e-06 -1.797e-06
## numbEmployed                                            -2.267e-06  1.239e-06
## educationBachelor's degree                              -3.756e-01 -7.380e-02
## educationDoctoral or professional degree                -3.923e-01 -3.710e-02
## educationHigh school diploma or equivalent               2.938e-02  3.190e-01
## educationMaster's degree                                -4.570e-01 -1.126e-01
## educationNo formal educational credential                5.263e-02  3.518e-01
## educationPostsecondary nondegree award                  -2.236e-01  1.332e-01
## educationSome college, no degree                        -3.057e-01  3.231e-01
## numbEmployed:educationBachelor's degree                 -1.217e-06  2.308e-06
## numbEmployed:educationDoctoral or professional degree   -1.311e-06  2.282e-06
## numbEmployed:educationHigh school diploma or equivalent -1.305e-06  2.208e-06
## numbEmployed:educationMaster's degree                   -2.091e-06  1.749e-06
## numbEmployed:educationNo formal educational credential  -1.220e-06  2.286e-06
## numbEmployed:educationPostsecondary nondegree award     -1.317e-06  2.316e-06
## numbEmployed:educationSome college, no degree           -1.122e-06  2.456e-06
##                                                          DF
## (Intercept)                                             685
## average_ann_wage                                        685
## numbEmployed                                            685
## educationBachelor's degree                              685
## educationDoctoral or professional degree                685
## educationHigh school diploma or equivalent              685
## educationMaster's degree                                685
## educationNo formal educational credential               685
## educationPostsecondary nondegree award                  685
## educationSome college, no degree                        685
## numbEmployed:educationBachelor's degree                 685
## numbEmployed:educationDoctoral or professional degree   685
## numbEmployed:educationHigh school diploma or equivalent 685
## numbEmployed:educationMaster's degree                   685
## numbEmployed:educationNo formal educational credential  685
## numbEmployed:educationPostsecondary nondegree award     685
## numbEmployed:educationSome college, no degree           685
## 
## Multiple R-squared:  0.4527 ,    Adjusted R-squared:  0.4399 
## F-statistic: 56.23 on 16 and 685 DF,  p-value: < 2.2e-16

Output

There are a LOT of packages out there to export and display various statistical results. You can take a look at them here

There are very few that export directly to word (flextable and huxtable) - rather, you would export html output and then copy and paste it into word or for a presentation or export a pdf from R markdown for a presentation. The reason that there are such few packages that export to word is because it’s a nightmare of bugs. Also, most researchers using R/Python use LaTex for formatting, thus most packages export to LaTex quite easily.

We will focus on two of them that make output in a variety of formats easy and pretty - stargazer and modelsummary.

Stargazer

Most people who use R for data analysis use stargazer as a way to present regression tables and summary statistics. It is by far the most popular package for this - possibly because it’s so easy to use the output looks very nice

## 
## My Data Models
## ===========================================================================================
##                                                          Dependent variable:               
##                                            ------------------------------------------------
##                                                                  prob                      
##                                                      (1)                      (2)          
## -------------------------------------------------------------------------------------------
## average_ann_wage                                 -0.00001***                               
##                                                   (0.00000)                                
##                                                                                            
## educationBachelor's degree                                                 -0.202***       
##                                                                             (0.049)        
##                                                                                            
## educationDoctoral or professional degree                                   -0.216***       
##                                                                             (0.080)        
##                                                                                            
## educationHigh school diploma or equivalent                                 0.201***        
##                                                                             (0.045)        
##                                                                                            
## educationMaster's degree                                                   -0.306***       
##                                                                             (0.067)        
##                                                                                            
## educationNo formal educational credential                                  0.250***        
##                                                                             (0.053)        
##                                                                                            
## educationPostsecondary nondegree award                                      -0.013         
##                                                                             (0.060)        
##                                                                                            
## educationSome college, no degree                                             0.142         
##                                                                             (0.146)        
##                                                                                            
## numbEmployed                                        0.000                   -0.000         
##                                                   (0.00000)                (0.00000)       
##                                                                                            
## median_ann_wage                                                           -0.00000***      
##                                                                            (0.00000)       
##                                                                                            
## Constant                                           0.921***                0.613***        
##                                                    (0.026)                  (0.052)        
##                                                                                            
## -------------------------------------------------------------------------------------------
## Observations                                         702                      702          
## R2                                                  0.303                    0.444         
## Adjusted R2                                         0.301                    0.436         
## Residual Std. Error                            0.308 (df = 699)        0.276 (df = 692)    
## F Statistic                                151.845*** (df = 2; 699) 61.304*** (df = 9; 692)
## ===========================================================================================
## Note:                                                           *p<0.1; **p<0.05; ***p<0.01

We can export the table into html with the option “out” - let’s check out what the files looks like after we export it.

## 
## <table style="text-align:center"><caption><strong>My models</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="2">prob</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">average_ann_wage</td><td>-0.00001<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationBachelor's degree</td><td></td><td>-0.202<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.049)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationDoctoral or professional degree</td><td></td><td>-0.216<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.080)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationHigh school diploma or equivalent</td><td></td><td>0.201<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.045)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationMaster's degree</td><td></td><td>-0.306<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.067)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationNo formal educational credential</td><td></td><td>0.250<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.053)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationPostsecondary nondegree award</td><td></td><td>-0.013</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.060)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">educationSome college, no degree</td><td></td><td>0.142</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.146)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">numbEmployed</td><td>0.000</td><td>-0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">median_ann_wage</td><td></td><td>-0.00000<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.921<sup>***</sup></td><td>0.613<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td>702</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td>0.444</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td>0.436</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td>0.276 (df = 692)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td>61.304<sup>***</sup> (df = 9; 692)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

We can easily see the regression when we directly input the code in R Markdown

There are many options within stargazer that we can play around to get our tables “just right” - and you will spend a lot of time doing this. Every researcher spends more time than they like to get their tables correct.

A very useful cheat sheet on stargazer can be found here

## 
## <table style="text-align:center"><caption><strong>OLS Results</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td colspan="2">Prob. of Automated Job</td></tr>
## <tr><td style="text-align:left"></td><td>OLS</td><td>OLS</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Avg. Salary</td><td>-0.00001<sup>***</sup></td><td></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td></td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">BA</td><td></td><td>-0.202<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.049)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">PhD+</td><td></td><td>-0.216<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.080)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">HS</td><td></td><td>0.201<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.045)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">MA</td><td></td><td>-0.306<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.067)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">None</td><td></td><td>0.250<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.053)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Postsecond</td><td></td><td>-0.013</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.060)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Some College</td><td></td><td>0.142</td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.146)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">N emp</td><td>0.000</td><td>-0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Med Salary</td><td></td><td>-0.00000<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td></td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">Constant</td><td>0.921<sup>***</sup></td><td>0.613<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.052)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td>702</td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td>0.444</td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td>0.436</td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td>0.276 (df = 692)</td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td>61.304<sup>***</sup> (df = 9; 692)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

###Robust standard errors and stargazer

## 
## <table style="text-align:center"><caption><strong>OLS Results</strong></caption>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"></td><td colspan="2"><em>Dependent variable:</em></td></tr>
## <tr><td></td><td colspan="2" style="border-bottom: 1px solid black"></td></tr>
## <tr><td style="text-align:left"></td><td>Prob. of Automated Job</td><td></td></tr>
## <tr><td style="text-align:left"></td><td>OLS</td><td>OLS robust</td></tr>
## <tr><td style="text-align:left"></td><td>(1)</td><td>(2)</td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Avg. Salary</td><td>-0.00001<sup>***</sup></td><td>-0.00001<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">BA</td><td>0.000</td><td>0.000</td></tr>
## <tr><td style="text-align:left"></td><td>(0.00000)</td><td>(0.00000)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td style="text-align:left">PhD+</td><td>0.921<sup>***</sup></td><td>0.921<sup>***</sup></td></tr>
## <tr><td style="text-align:left"></td><td>(0.026)</td><td>(0.028)</td></tr>
## <tr><td style="text-align:left"></td><td></td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left">Observations</td><td>702</td><td></td></tr>
## <tr><td style="text-align:left">R<sup>2</sup></td><td>0.303</td><td></td></tr>
## <tr><td style="text-align:left">Adjusted R<sup>2</sup></td><td>0.301</td><td></td></tr>
## <tr><td style="text-align:left">Residual Std. Error</td><td>0.308 (df = 699)</td><td></td></tr>
## <tr><td style="text-align:left">F Statistic</td><td>151.845<sup>***</sup> (df = 2; 699)</td><td></td></tr>
## <tr><td colspan="3" style="border-bottom: 1px solid black"></td></tr><tr><td style="text-align:left"><em>Note:</em></td><td colspan="2" style="text-align:right"><sup>*</sup>p<0.1; <sup>**</sup>p<0.05; <sup>***</sup>p<0.01</td></tr>
## </table>

#vtable

For quick summary statistics you can you Vtable

## Loading required package: kableExtra
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
Summary Statistics
Variable N Mean Std. Dev. Min Pctl. 25 Pctl. 75 Max
X_…rank 702 351.5 202.794 1 176.25 526.75 702
prob 702 0.536 0.368 0.003 0.11 0.89 0.99
Average.annual.wage 702 55645.278 28985.949 20460 35757.5 66477.5 232870
len 702 33.479 17.486 6 19 44 92
probability 702 0.536 0.368 0.003 0.11 0.89 0.99
numbEmployed 702 186835.726 423140.174 370 17307.5 151702.5 4528550
median_ann_wage 702 51345.468 26912.979 19290 33355 61652.5 232870
average_ann_wage 702 55645.278 28985.949 20460 35757.5 66477.5 232870

There are also other options that exist with datasummary that you can check out here

Breakout Group Exercise

  1. Import the dataset: “WAGE1.DTA” (in the same folder as this lecture on github) (hint: haven)
  2. Use OLS to estimate the effect education has on wages - be sure to include relevant controls and functional form -
  3. Create a dummy variable if the person has a child(=1, =0 if none)
  4. Use OLS to estimate Q2, but include the dummy variable you created in Q3 and be sure that wages are logged.
  5. Create a stargazer table that include regressions from Q2 and Q4.